March 6, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Thus far, we have only seen deep learning architectures where the information from the inputs flows upwards through the model
Inputs go through layer and yield hidden values
Hidden values to more hidden values
Feedforward in the sense that the only direct influence of a layer is to its child
For tabular data and image analysis:
A single input relates to a single output
Classification \(\rightarrow\) Class
Image Segmentation Map \(\rightarrow\) Semantic Segmentation over \(H \times W\) image
A single input leads to two (or more) independent outputs
Regression \(\rightarrow\) (Mean, Variance)
Bounding Box \(\rightarrow\) (Class, Location)
Image captioning (One image to a sequence of words)
Other examples?
Video classification (Many images/frames to one video class)
Other examples?
Not necessarily equal number of inputs and outputs
Machine translation (Sequence of words in English to Seq in French)
Sentence continuation (First part of sentence to second part of sentence)
Time series (Stock Prices for the last 10 years to Stock Prices for the next week)
Equal number of outputs and inputs
Per-frame video classification (Seq of Images to Seq of Words)
Per-sentence sentiment analysis (Seq of sentences to Class)
Recurrent Neural Networks (RNNs) are the backbone of sequence modeling
Inputs or outputs make sense temporally
Knowing what came before or what comes after should inform what we predict at a current point in time
Adds knowledge to the NN architecture
Units need to talk backwards in time - all of the previous values need to influence what I guess next!
Note, we’re going to use time pretty loosely here - just some sort of meaningful temporal flow
If the order of the input matters, then we should preserve that information!
For now, let’s think in terms of the many-to-many framework:
A sequence of input vectors , \(\mathbf X = \{\mathbf x_1, \mathbf x_2,...,\mathbf x_T \}\), where each vector is of length \(P\)
A sequence of output vectors, \(\mathbf Y = \{\mathbf y_1, \mathbf y_2,...,\mathbf y_T\}\), where each vector is of length \(G\)
Goal: Use \(\mathbf X\) to predict \(\mathbf Y\)
Example: Predict the next character
\[ \mathbf X = [[h],[e],[l],[l]] \]
\[ \mathbf Y = [[e],[l],[l],[o]] \]
We can process a sequence of vectors, \(\mathbf x\), by applying a recurrence formula at every time step:
\[ h_t = f_w(h_{t - 1}, x_t) \]
\(h_{t - 1}\) is the old state (some vector of numbers)
\(x_t\) is the input state at time \(t\) (some vector)
\(f_w()\) is a function of some parameters, \(\mathbf W\) (some collection of vectors)
\(h_t\) is the new state (some vector)
The hidden state vector contains latent information about the current state of the model
For language models, it could tell us about the gender of pronouns in the sentence
Or the passiveness of the tone
This information should persist throughout the sequence of predictions!
Start with an initial state, \(\mathbf h_0\)
Going from \(\mathbf h_0\) to \(\mathbf h_1\):
\[ h_1 = f_w(h_{0}, x_1) \]
The vanilla update:
\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]
\(\mathbf h_t\) and \(\mathbf h_{t-1}\) are hidden vectors of size \(m\)
\(\mathbf W_{hh}\) is a \(m \times m\) matrix of weights that controls the mapping of one hidden state vector to the next
\(\mathbf W_{xh}\) is a \(m \times P\) matrix of weights that controls the mapping of the input to the hidden state
\(\mathbf b_h\) is a vector of bias terms
\(\text{tanh}()\) is an activation function
Known values:
Unknown values to be learned:
\(\mathbf h_t\) and \(\mathbf h_{t - 1}\)
\(\mathbf W_{hh}\), \(\mathbf W_{xh}\), and \(\mathbf b_h\)
The vanilla update:
\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]
Drawing some parallels to what we’ve seen before:
\(\mathbf h_t\) is like a higher level of hidden values that is a function of the previous set of hidden values passed through an activation function
\(\mathbf W_{hh}\) is the matrix of hidden weights that maps these two hidden values
\(\mathbf W_{xh}\) is a matrix of weights that grounds the new hidden state to its input
A FCNN with a twist!
Tanh is used here over ReLU for a number of different reasons:
RNNs aren’t going to promote sparsity since there isn’t a meaningful way to map hidden units to input features
RNNs have a really serious vanishing gradient problem (more to come)
Unlike CNNs, RNNs have a really serious exploding gradient problem!
Induce nonlinearities without sparsity in the feature maps for optimization purposes
Squish before sparsity here!
We’ll see sigmoids reappear in a little bit!
Note that \(\mathbf x_1\) only relates to \(\mathbf y_1\) through \(\mathbf h_1\)!
Given a hidden state, we say that \(\mathbf y_t\) is determined as:
\[ \underset{(G \times 1)}{\mathbf y_t} = h(\underset{(G \times m)}{\mathbf W_{hy}} \underset{(m \times 1)}{\mathbf h_t} + \underset{(G \times 1)}{\mathbf b_y}) \]
A linear combination of a weight matrix and the hidden state (plus a bias vector)
\(h()\) is a function that maps the linear predictor to the scale of the actual output (softmax for classes, identity for regression)
The part that is so simple that it’s kinda dumb
Use the same \(\mathbf W_{hh}\), \(\mathbf W_{xh}\), and \(\mathbf W_{hy}\) at all time points
Let’s think about how \(\mathbf y_2\) is computed.
\(\mathbf y_2\) is a function of \(\mathbf h_2\)
\(\mathbf h_2\) is a function of \(\mathbf h_1\) and \(\mathbf x_2\)
\(\mathbf h_1\) is a function of \(\mathbf h_0\) and \(\mathbf x_1\)
The second output in the sequence is determined by
All previous hidden states
All previously seen input vectors!
Let \(\xi()\) be the \(\text{tanh}\) activation function. Dropping bias terms for notational simplicity:
\[y_2 = \mathbf W_{hy} \mathbf h_2\]
\[\mathbf h_2 = \xi(\mathbf W_{hh} \mathbf h_1 + \mathbf W_{xh} \mathbf x_2)\]
\[\mathbf h_1 = \xi(\mathbf W_{hh} \mathbf h_0 + \mathbf W_{xh} \mathbf x_1)\]
Stacking it all together:
\[\mathbf y_2 = \mathbf W_{hy} \xi(\mathbf W_{hh} \xi(\mathbf W_{hh} \mathbf h_0 + \mathbf W_{xh} \mathbf x_1) + \mathbf W_{xh} \mathbf x_2)\]
Assuming linear activation functions, we can see how the weights stack:
\[ \mathbf y_2 = \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{hh} \mathbf h_0 + \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{xh} \mathbf x_1 + \mathbf W_{hy} \mathbf W_{xh} \mathbf x_2 \]
\[ \begin{align} \mathbf y_3 = & \mathbf W_{hy} \mathbf W_{xh} \mathbf x_3 + \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{xh} \mathbf x_2 + \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{hh} \mathbf W_{xh} \mathbf x_1 +\\ & \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{hh} \mathbf W_{hh} \mathbf h_0 \end{align} \]
If the largest singular value of \(\mathbf W_{hh}\) is less than 1, then the effect of each input diminishes as time goes on!
Small times small (e.g. absolute value less than 1) is even smaller.
The RNN is said to have infinite memory
What it saw before will always play a part (albeit with diminishing influence) in what it predicts next!
What about training a RNN?
Associate each prediction - \(\{\hat{\mathbf y}_1,\hat{\mathbf y}_2,...\}\) - with the actual outcome - \(\{\mathbf y_1,\mathbf y_2,...\}\) - and compute per vector loss
Then, add ’em together
\[ \mathcal L(\hat{\mathbf Y} , \mathbf Y) = \sum \limits_{t = 1}^{T_y} \mathcal L(\hat{\mathbf y}_t , \mathbf y_t) \]
Many-to-One:
One-to-Many:
Same deal over and over again!
RNNs are widely applicable, but not all that different than what we’ve already seen!
Pre-GPT, the seq2seq model was a main approach for question/answer models
This is another example of an encoder/decoder architecture:
Use weights to determine the mapping of the inputs to a final hidden state
Use that hidden state as the beginning of a new mapping to to the outcome
Computationally, not that bad!
Let’s look at an aligned many-to-many model!
Goal: Given input characters, build a model to predict the next character!
For vanilla test-time, generate new text
Right now, a basic one-hot encoded character vector
Text embeddings are often used to pre-process text (an inherently discrete variable) into a lower dimensional continuous space
Same idea as the embeddings you’ve used on PS2!
Can RNNs be deep? Sure!
But, there’s a fundamental problem to address first - gradients
Given \(\mathbf X\) and \(\mathbf Y\), we need to learn \(\mathbf W_{hh}\), \(\mathbf W_{xh}\), and \(\mathbf W_{hy}\)
The hidden states \(\{\mathbf h_1,...,\mathbf h_T\}\) are just a function of these values
Backprop through time:
Set initial values for the weights and the initial hidden states
Forward pass through time to compute the hidden state values
Backwards pass through time to compute the gradients
Many-to-One:
The gradient for \(\mathbf W_{hh}\):
\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T}\frac{\partial \mathbf h_{T}}{\partial \mathbf h_{T - 1}}\frac{\partial \mathbf h_{T-1}}{\partial \mathbf h_{T - 2}}...\frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \]
Let’s assume that our activation function is tanh:
\[ \mathbf h_t = \xi(\mathbf W_{hh} \mathbf h_{t - 1} + \mathbf W_{xh} \mathbf x_t) \]
Then:
\[ \frac{\partial \mathbf h_{t}}{\partial \mathbf h_{t - 1}} = \text{tanh}'(\mathbf W_{hh} \mathbf h_{t-1} + \mathbf W_{xh} \mathbf x_t) \mathbf W_{hh} \]
The derivative of tanh w.r.t its input is always between 0 and 1!
The gradient for \(\mathbf W_{hh}\):
\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T}\frac{\partial \mathbf h_{T}}{\partial \mathbf h_{T - 1}}\frac{\partial \mathbf h_{T-1}}{\partial \mathbf h_{T - 2}}...\frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \]
can be expressed more compactly as:
The gradient for \(\mathbf W_{hh}\):
\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T} \frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \prod \limits_{t = 2}^T \frac{\partial \mathbf h_t}{\partial \mathbf h_{t - 1}} \]
The gradient for \(\mathbf W_{hh}\):
\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T} \frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \prod \limits_{t = 2}^T \frac{\partial \mathbf h_t}{\partial \mathbf h_{t - 1}} \]
Plugging in what we know:
\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T} \frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \mathbf W_{hh}^{T - 1} \prod \limits_{t = 2}^T \text{tanh}'(\mathbf W_{hh} \mathbf h_{t-1} + \mathbf W_{xh} \mathbf x_t) \]
The gradient for the recurrence weights is almost surely going to vanish as the number of hidden states gets large!
And not even that many
Any values greater than 3 or less than 3 will wreck the gradient
Double trouble:
\(\lim_{t \to \infty} \mathbf W_{hh}^t\) depends on the largest singular value of the matrix
If it’s greater than 1, goes to \(\infty\)
If it’s less than 1, goes to zero
There’s no escape…
A simple solution is called truncated backpropogation
Limit the time window over which the loss gradient can propogate
Say no more than 10 steps in the past
Problem: We’re potentially giving up on important sequence information if we just say that there is no effect of \(\mathbf h_1\) on \(\mathbf h_{11}\)
A clever solution - The Long Short Term Memory model (LSTM)
There’s a lot going on here. Let’s take it step-by-step.
Remember that the next state at each step of an RNN is determined as:
\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]
Running example:
LSTMs alter the RNN model by having two concurrently running hidden units - the hidden unit and the cell state
The cell state runs straight down the chain with only some minor linear interactions
The cell state is of the same size as the hidden value!
The hidden state can add or remove information from the cell state through gates
A sigmoid layer (which maps a value between 0 and 1) determines when and how much the hidden state should influence the next one!
LSTMs have 3 of these gates to protect and control the cell state.
The first gate is called the forget gate
The forget gate controls how we lose information from the previous memory
In a language model, maybe the hidden state encodes the gender of the subject.
Maybe the subject changes and uses different pronouns?
It’s not worthwhile to remember the gender state in this case.
The sigmoid activation is applied to the weighted hidden state element-wise
The second gate is called the input gate
Two weight matrices - the input weights and the candidate gate
A candidate hidden vector is created using the standard update with a tanh activation, \(\tilde{\mathbf C}_t\)
The input gate applies a value between zero and 1 to each element of the candidate gate that says how much of the new information should be combined with the old info to create the new hidden state
If we’re changing subjects and pronouns, we want a new state that corresponds to the new subject!
The new cell state is a linear combination of the old cell state and the new cell state weighted by the forget weight and the input (new info) weight
Finally, we need to map the cell state to the next hidden value
The output gate create a filtered version of the cell state
First, there is a sigmoid activation which determines which parts of the old state are going to combine with the cell state to get the new state
Then, we push the cell state through a tanh activation around the cell state and multiply it by the output gate so that we only keep the parts of the cell state that are useful for the next prediction!
The earlier gates deal with changes to the state, but the output gate determines what parts of the state need to be altered
What does this accomplish in terms of gradients?
Note that \(\mathbf h_{t-1}\) connects to \(\mathbf h_t\) in the regular way (just a few more sets of weights)
\(\mathbf C_t\) only connects to \(\mathbf C_{t-1}\) through \(\mathbf f_t\)
No direct interaction with \(\mathbf h\) or the weights!
We can think of the cell state as a smooth version of the hidden state that is updated with different small perturbations at each time period
\[ \frac{\partial \mathbf C_t}{\partial \mathbf C_{t - 1}} = f_t \]
LSTMs are the default - not a special case
We’re trading ease of training for some expressiveness of the hidden state
Changes are just going to be much more smooth than the vanilla RNN
Most of the time, this is closer to optimal than not
Let’s look at some examples.
RNNs (at least 5 years ago) were the main tool for dealing with sequence problems
A new framework has emerged - multiheaded self attention
The famous transformer architecture
We’ll start discussing this next class.